Chapter 13: Cluster Analysis and Factor Analysis
Welcome to the online content for Chapter 13!
As always, I’ll assume that you’ve already read up to this chapter of the book and worked through the online content for the previous chapters. If not, please do that first.
As always, click the ‘Run Code’ buttons below to execute the R code. Remember to wait until they say ‘Run Code’ before you press them. And be careful to run these boxes in order if later boxes depend on you having done other things previously.
Cluster Analysis
Let’s begin by reading in the data set that I used throughout this chapter:
This data set contains 60 people, so 60 rows of data. With larger datasets, printing out the entire data set can be unwieldy, so we often use the head
and tail
functions in R instead, to check the top and bottom few lines.
Try this
and this:
For cluster analysis, we need to normalise our survey data, which means standardising each person’s response on each scale by expressing it in terms of standard deviations from the mean. The scale
command does this for us.
Now, we can run a cluster analysis. We’ll use the hclust
function to do a hierarchical cluster analysis:
We create an object called ‘h’, which is our cluster analysis, and then we plot
that.
The dendrogram matches the one in the chapter, with the numbers indicating which person is which.
To do \(\boldsymbol k\)-means clustering, we need to decide how many clusters we want. Let’s try 3.
We use the kmeans
function, and the ‘3’ tells R that we want 3 clusters.
We get a lot of detailed output, telling us the number of people in each cluster, adding up to 60, the cluster centres (means) for each item, and which cluster each person is assigned to.
Cronbach’s alpha
The easiest way calculate Cronbach’s alpha is to install a specialist package:
We wait a few moments while the installation takes place. (We may get some messages that we can ignore.) And then we can use the cronbach.alpha
function on our data set.
We get the value 0.59 that I discussed in the chapter.
Principal Components Analysis
We can produce a correlation matrix by using the cor
function that we met in Chapter 9:
Because ‘survey’ is a dataframe containing 6 variables, instead of getting a single number, we get a table showing the correlation of each variable with every other variable.
We can also obtain the scatterplot matrix using plot
:
Running a principal components analysis on our data is easy, using the prcomp
function:
We see the loading for each item onto each principal component. The values for the first two principal components match those I presented in the chapter.
We can use summary(…)
on our pca object to get some more information:
We see that principal component 1 (PC1) accounts for 0.3411 of the variance, which is 34.1%. PC2 accounts for 30.5%, and PC1 and PC2 together account for the sum of these, which is given in the ‘Cumulative Proportion’ row as 64.6%. (‘Cumulative’ means a ‘running’ total - the total up to that point.)
If we use plot
on our pca object, we get the scree plot, showing the variances accounted for by each component:
Using plot
with $x
gives us each participant’s values on PC1 and PC2, so we can see how PC1 and PC2 are related.
We know that PC1 and PC2 must be uncorrelated, and that fits with the impression of the scatter plot.
Factor Analysis
Running a factor analysis on our data is also straightforward, using the factanal
(‘factor analysis’) function:
The argument factors=2
tells R to give us a 2-factor model. You could try changing this to other numbers and see what happens.
‘Uniqueness’ tells us how much variance is ‘unique’ to that variable, and not shared with any of the other variables.
The ‘Loadings’ table shows how strongly each item loads onto each of the two factors. We can see that the first factor accounts for 26.1% of the variance and the first two factors together account for 48.3% of the total variance.
The results of the chi-squared test are explained in the chapter.
To perform a rotation, we can include a rotation
argument:
There are different rotation options, besides ‘varimax’. In this case, ‘varimax’ makes no difference to the results, because the factors were already orthogonal.